Fundamental Techniques in Data Science with R

Today

This lecture

  • Leftover slides from last week

  • Data manipulation

  • Basic analysis (correlation & t-test)

  • Pipes

  • Deviations and modeling

Some programming tips:

  • keep your code tidy
  • use comments (text preceded by #) to clarify what you are doing
    • If you look at your code again, one month from now: you will not know what you did –> unless you use comments
  • when working with functions, use the TAB key to quickly access the help for the function’s components
  • work with logically named R-scripts
    • indicate the sequential nature of your work
  • work with RStudio projects
  • if allowed, place your project folders in some cloud-based environment

Functions

Functions have parentheses (). Names directly followed by parentheses always indicate functions. For example;

  • matrix() is a function
  • c() is a function
  • but (1 - 2) * 5 is a calculation, not a function

Packages

Packages give additional functionality to R.

By default, some packages are included. These packages allow you to do mainstream statistical analyses and data manipulation. Installing additional packages allow you to perform the state of the art in statistical programming and estimation.

The cool thing is that these packages are all developed by users. The throughput process is therefore very timely:

  • newly developed functions and software are readily available
  • this is different from other mainstream software, like SPSS, where new methodology may take years to be implemented.

A list of available packages can be found on CRAN

Loading packages

There are two ways to load a package in R

library(stats)

and

require(stats)

Installing packages

The easiest way to install e.g. package mice is to use

install.packages("mice")

Alternatively, you can also do it in RStudio through

Tools --> Install Packages

R in depth

Workspaces and why you should sometimes save them

A workspace contains all changes you made to R.

A saved workspace contains everything at the time of the state wherein it was saved.

You do not need to run all the previous code again if you would like to continue working at a later time.

  • You can save the workspace and continue exactly where you left.

Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets.

History and why it is useful

R by default saves (part of) the code history and RStudio expands this functionality greatly.

Most often it may be useful to look back at the code history for various reasons.

  • There are multiple ways to access the code history.

    1. Use arrow up in the console. This allows you to go back in time, one codeline by one. Extremely useful to go back to previous lines for minor alterations to the code.
    2. Use the history tab in the environment pane. The complete project history can be found here and the history can be searched. This is particularly convenient when you know what code you are looking for.

Working in projects in RStudio

  • Every project has its own history
  • Every research project has its own project
  • Every project can have its own folder, which also serves as a research archive
  • Every project can have its own version control system
  • R-studio projects can relate to Git (or other online) repositories

In general…

  • Use common sense and BE CONSISTENT.

  • Browse through the tidyverse style guide

    • The point of having style guidelines is to have a common vocabulary of coding
    • so people can concentrate on what you are saying, rather than on how you are saying it.
  • If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers and collaborators out of their rhythm when they go to read it. Try to avoid this.

  • Intentional spacing makes your code easier to interpret

    • a<-c(1,2,3,4,5) vs;
    • a <- c(1, 2, 3, 4, 5)
  • at least put a space after every comma!

New packages we use

library(MASS)     # for the cats data
library(dplyr)    # data manipulation
library(haven)    # in/exporting data
library(magrittr) # pipes
library(mice)     # for the nhanes data

New functions

  • transform(): changing and adding columns
  • dplyr::filter(): row-wise selection (of cases)
  • table(): frequency tables
  • class(): object class
  • levels(): levels of a factor
  • order(): data entries in increasing order
  • haven::read_sav(): import SPSS data
  • cor(): bivariate correlation
  • sample(): drawing a sample
  • t.test(): t-test

Data manipulation

The cats data

head(cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6
str(cats)
## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

How to get only Female cats?

fem.cats <- cats[cats$Sex == "F", ]
dim(fem.cats)
## [1] 47  3
head(fem.cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

How to get only heavy cats?

heavy.cats <- cats[cats$Bwt > 3, ]
dim(heavy.cats)
## [1] 36  3
head(heavy.cats)
##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

How to get only heavy cats?

heavy.cats <- subset(cats, Bwt > 3)
dim(heavy.cats)
## [1] 36  3
head(heavy.cats)
##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

another way: dplyr

filter(cats, Bwt > 2, Bwt < 2.2, Sex == "F")
##   Sex Bwt Hwt
## 1   F 2.1 7.2
## 2   F 2.1 7.3
## 3   F 2.1 7.6
## 4   F 2.1 8.1
## 5   F 2.1 8.2
## 6   F 2.1 8.3
## 7   F 2.1 8.5
## 8   F 2.1 8.7
## 9   F 2.1 9.8

Working with factors

class(cats$Sex)
## [1] "factor"
levels(cats$Sex)
## [1] "F" "M"

Working with factors

levels(cats$Sex) <- c("Female", "Male")
table(cats$Sex)
## 
## Female   Male 
##     47     97
head(cats)
##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.1 7.2
## 5 Female 2.1 7.3
## 6 Female 2.1 7.6

Sorting

sorted.cats <- cats[order(cats$Bwt), ]
head(sorted.cats)
##       Sex Bwt Hwt
## 1  Female 2.0 7.0
## 2  Female 2.0 7.4
## 3  Female 2.0 9.5
## 48   Male 2.0 6.5
## 49   Male 2.0 6.5
## 4  Female 2.1 7.2

Combining matrices or dataframes

cats.numbers <- cbind(Weight = cats$Bwt, HeartWeight = cats$Hwt)
head(cats.numbers)
##      Weight HeartWeight
## [1,]    2.0         7.0
## [2,]    2.0         7.4
## [3,]    2.0         9.5
## [4,]    2.1         7.2
## [5,]    2.1         7.3
## [6,]    2.1         7.6

Combining matrices or dataframes

rbind(cats[1:3, ], cats[1:5, ])
##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.0 7.0
## 5 Female 2.0 7.4
## 6 Female 2.0 9.5
## 7 Female 2.1 7.2
## 8 Female 2.1 7.3

Basic analysis

Correlation

cor(cats[, -1])
##           Bwt       Hwt
## Bwt 1.0000000 0.8041274
## Hwt 0.8041274 1.0000000

With [, -1] we exclude the first column

Correlation

cor.test(cats$Bwt, cats$Hwt)
## 
##  Pearson's product-moment correlation
## 
## data:  cats$Bwt and cats$Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274

What do we conclude?

Correlation

plot(cats$Bwt, cats$Hwt)

T-test

Test the null hypothesis that the difference in mean heart weight between male and female cats is 0

t.test(formula = Hwt ~ Sex, data = cats)
## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

T-test

plot(formula = Hwt ~ Sex, data = cats)

Pipes

This is a pipe:

boys <- 
  read_sav("boys.sav") %>%
  head()

It effectively replaces head(read_sav("boys.sav")).

Why are pipes useful?

Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like

boys  <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))

With the pipe:

boys <-
  read_sav("boys.sav") %>%
  transform(hgt = hgt/100) %>%
  filter(age > 15) %>%
  subset(select = c(hgt, wgt, bmi))

Benefit: a single object in memory that is easy to interpret

With pipes

Your code becomes more readable:

  • data operations are structured from left-to-right and not from in-to-out
  • nested function calls are avoided
  • local variables and copied objects are avoided
  • easy to add steps in the sequence

What do pipes do:

  • f(x) becomes x %>% f()
rnorm(10) %>% mean()
## [1] -0.2342249
  • f(x, y) becomes x %>% f(y)
boys %>% cor(use = "pairwise.complete.obs")
##           hgt       wgt       bmi
## hgt 1.0000000 0.6100784 0.1758781
## wgt 0.6100784 1.0000000 0.8841304
## bmi 0.1758781 0.8841304 1.0000000
  • h(g(f(x))) becomes x %>% f %>% g %>% h
boys %>% subset(select = wgt) %>% na.omit() %>% max()
## [1] 117.4

Useful: outlier filtering

nrow(cats)
## [1] 144
cats.outl <- 
  cats %>% 
  filter(Hwt < mean(Hwt) + 3 * sd(Hwt), 
         Hwt > mean(Hwt) - 3 * sd(Hwt))

nrow(cats.outl)
## [1] 143
cats %>%  
  filter(Hwt > mean(Hwt) + 3 * sd(Hwt))
##    Sex Bwt  Hwt
## 1 Male 3.9 20.5

More pipe stuff

The standard %>% pipe

HTML5 Icon

The %$% pipe

HTML5 Icon

The %T>% pipe

HTML5 Icon

The role of . in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

cats %>%
  plot(Hwt ~ Bwt, data = .)

VS

cats %$%
  plot(Hwt ~ Bwt)

The . can be used as a placeholder in the pipe.

Performing a t-test in a pipe

cats %$%
  t.test(Hwt ~ Sex)
## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

is the same as

t.test(Hwt ~ Sex, data = cats)

Storing a t-test from a pipe

cats.test <- cats %$%
  t.test(Bwt ~ Sex)

cats.test
## 
##  Welch Two Sample t-test
## 
## data:  Bwt by Sex
## t = -8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6631268 -0.4177242
## sample estimates:
## mean in group Female   mean in group Male 
##             2.359574             2.900000

Squared deviations

We have already met deviations today

  • correlations

\[\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{\mathrm{E}[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y}.\]

  • t-test

\[t = \frac{\bar{X}-\mu}{\hat{\sigma}/\sqrt{n}}.\]

  • variance

\[\sigma^2_X = \mathrm{E}[(X - \mu)^2].\]

Can you identify which part of these equations are the deviations?

Deviations

Deviations tell what the distance is for each value (observation) to a comparison value.

  • often, the mean is chosen as a comparison value.

Why the mean?

The arithmetic mean is a very informative measure:

  • it is the average
  • it is the mathematical expectation
  • it is the central value of a set of discrete numbers

\[ \] \[\text{The mean itself is a model: observations are}\] \[\text{merely a deviation from that model}\]

The mean as a center

Deviations from the mean

Plotting the deviations

Use of deviations

Deviations summarize the fit of all the points in the data to a single point

The mean is the mathematical expectation. It represents the observed values best for a normally distributed univariate set.

  • The mean yields the lowest set of deviations
plotdata %>%
  mutate("Mean" = X - mean(X), 
         "Mean + 3" = X - (mean(X) + 3)) %>%
  select("Mean", "Mean + 3") %>%
  colSums %>%
  round(digits = 3)
##     Mean Mean + 3 
##        0     -300

The mean minimizes the deviations

What happens

Plotting the standardized deviations

Plotting the squared deviations

Why squared deviations are useful

Throughout statistics we make extensive use of squaring. \[ \] \[\text{WHAT ARE THE USEFUL PROPERTIES OF SQUARING}\] \[\text{THAT STATISTICIANS ARE SO FOND OF?}\]

Deviations from the mean

Deviations from the mean

Least squares solution

Least squares solution

What’s next

During the practical exercises this week we will learn to calculate and explore squared deviations. During these exercises you’ll learn to minimize the deviations.

Next week we’ll jump to regression. We’ll see how deviations still play a role in that framework:

  • We minimize deviations to infer a linear model
  • If our model is not perfect, observations will still deviate from the model